{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pose & PDBinfo\n",
    "PDBinfo是Pose和PDB中信息交换和储存的重要媒介。Pose是通常是从PDB文件中衍生出来的，除了原子的坐标信息以外，PDB文件中含包含了许多额外的信息，而这些信息是储存在PDBinfo中。比如温度因子数据(bfactor)、晶体解析数据(crystinfo)、原子的占用率(occupancy)、Pose编号与PDB编号的转换以及Pose的序列信息等。如果Pose中的氨基酸发生了插入和删除，又或者其他的信息发生了变化，那么这更新的信息就需要再次更新到Pose中。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. PDB编号与Pose编号"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在PDB文件中，每个氨基酸都有自己独立的编号，并且氨基酸的PDB编号是依赖于其所在的链ID，如1A，2A...120A，1B，2B...130B等，分别代表A链和B链上的氨基酸位置。\n",
    "而在Pose的概念中，氨基酸的编号是忽略链的分隔，按照PDB文件中，链编写的顺序，**从1开始递增**，因此在Pose中的氨基酸和PDB编号的对应是新手处理相当棘手的，但是我们可以通过PDB_info这个的功能，来轻松解决编号转换的问题。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PyRosetta-4 2020 [Rosetta PyRosetta4.conda.mac.cxx11thread.serialization.python36.Release 2020.23+release.0d6f90a8cb9fa0567ca76bb71ee93bfe73340c70 2020-06-04T19:12:24] retrieved from: http://www.pyrosetta.org\n",
      "(C) Copyright Rosetta Commons Member Institutions. Created in JHU by Sergey Lyskov and PyRosetta Team.\n",
      "\u001b[0mcore.init: {0} \u001b[0mChecking for fconfig files in pwd and ./rosetta/flags\n",
      "\u001b[0mcore.init: {0} \u001b[0mRosetta version: PyRosetta4.conda.mac.cxx11thread.serialization.python36.Release r257 2020.23+release.0d6f90a8cb9 0d6f90a8cb9fa0567ca76bb71ee93bfe73340c70 http://www.pyrosetta.org 2020-06-04T19:12:24\n",
      "\u001b[0mcore.init: {0} \u001b[0mcommand: PyRosetta -ex1 -ex2aro -database /opt/miniconda3/envs/pyrosetta/lib/python3.6/site-packages/pyrosetta/database\n",
      "\u001b[0mbasic.random.init_random_generator: {0} \u001b[0m'RNG device' seed mode, using '/dev/urandom', seed=765754884 seed_offset=0 real_seed=765754884 thread_index=0\n",
      "\u001b[0mbasic.random.init_random_generator: {0} \u001b[0mRandomGenerator:init: Normal mode, seed=765754884 RG_type=mt19937\n",
      "\u001b[0mcore.import_pose.import_pose: {0} \u001b[0mFile './data/4R80.clean.pdb' automatically determined to be of type PDB\n",
      "\u001b[0mcore.conformation.Conformation: {0} \u001b[0m\u001b[1m[ WARNING ]\u001b[0m missing heavyatom:  OXT on residue SER:CtermProteinFull 76\n",
      "\u001b[0mcore.conformation.Conformation: {0} \u001b[0m\u001b[1m[ WARNING ]\u001b[0m missing heavyatom:  OXT on residue SER:CtermProteinFull 152\n"
     ]
    }
   ],
   "source": [
    "# 首先依然是从PDB中读入Pose\n",
    "from pyrosetta import init, pose_from_pdb\n",
    "init()\n",
    "pose = pose_from_pdb('./data/4R80.clean.pdb')\n",
    "\n",
    "# 获取PDBinfo对象:\n",
    "pose_pdbinfo = pose.pdb_info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**编号转换的处理主要使用的是pdb_info中的pdb2pose和pose2pdb函数。**\n",
    "- pdb2pose: 负责将pdb编号翻译成pose编号;\n",
    "- pose2pdb: 与pdb2pose反作用;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "24\n"
     ]
    }
   ],
   "source": [
    "# 获取PDB号为24A的氨基酸残基所在的Pose残基编号\n",
    "pose_number = pose_pdbinfo.pdb2pose('A', 24)\n",
    "print(pose_number)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "24 A \n"
     ]
    }
   ],
   "source": [
    "# 获取Pose残基编号为24的氨基酸残基所在的PDB号\n",
    "pdb_number = pose_pdbinfo.pose2pdb(24)\n",
    "print(pdb_number)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. PDB链信息\n",
    "由于Pose中链也是从1号开始编，不同于PDB文件中链的信息以字母的形式进行储存。链中信息获取的方式有多种途径:\n",
    "- 根据链的PDB chain ID获取Pose chain ID;\n",
    "- 根据链的chain ID获取PDB chain ID;\n",
    "- 根据氨基酸的Pose编号获取其所在的chain ID;\n",
    "- Pose的链氨基酸序列信息;\n",
    "- 获取链起始和末端氨基酸信息;\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'A:1-76 B:1-76'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 先来看看pdbinfo中链的基本信息:\n",
    "pose_pdbinfo.short_desc()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可见我们的这个pose中共有2条链，分别为A和B链，分别都有76个氨基酸"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 2\n"
     ]
    }
   ],
   "source": [
    "# 根据链的PDB chain ID获取Pose chain ID;\n",
    "from pyrosetta.rosetta.core.pose import get_chain_id_from_chain\n",
    "chainA_pose_chain_id = get_chain_id_from_chain('A', pose)\n",
    "chainB_pose_chain_id = get_chain_id_from_chain('B', pose)\n",
    "print(chainA_pose_chain_id, chainB_pose_chain_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A B\n"
     ]
    }
   ],
   "source": [
    "from pyrosetta.rosetta.core.pose import get_chain_from_chain_id\n",
    "chain1_pdb_chain_id = get_chain_from_chain_id(1, pose)\n",
    "chain2_pdb_chain_id = get_chain_from_chain_id(2, pose)\n",
    "print(chain1_pdb_chain_id, chain2_pdb_chain_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A B\n"
     ]
    }
   ],
   "source": [
    "# 根据某个氨基酸残基的Pose编号获取其所在的PDB链ID;\n",
    "residue1_chain_id = pose_pdbinfo.chain(1)\n",
    "residue77_chain_id = pose_pdbinfo.chain(77)\n",
    "print(residue1_chain_id, residue77_chain_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 76\n"
     ]
    }
   ],
   "source": [
    "# 获取链的的起始和结尾的氨基酸Pose编号:\n",
    "chain1_start_pose_id = pose.chain_begin(1) # 返回某链起始的rosetta index\n",
    "chain1_end_pose_id = pose.chain_end(1) # 返回某链终止的rosetta index\n",
    "print(chain1_start_pose_id, chain1_end_pose_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'PSEEEEKRRAKQVAKEKILEQNPSSKVQVRRVQKQGNTIRVELEITENGKKTNITVEVEKQGNTFTVKRITETVGS'"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 获取链的序列信息:\n",
    "pose.chain_sequence(1) # 返回某链的氨基酸序列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. PDB中晶体解析信息的提取\n",
    "除了基本的编号信息以外，一些晶体相关的信息也可以轻松进行提取:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "49.13"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 提取bfactor信息\n",
    "pose.pdb_info().bfactor(1,1)  # 返回温度因子信息"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<CrystInfo>{0,0,0,90,90,90 : P 1}\n"
     ]
    }
   ],
   "source": [
    "# 获取PDB的晶体信息\n",
    "crystinfo = pose.pdb_info().crystinfo()\n",
    "print(crystinfo)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 获取原子的occupancy\n",
    "pose.pdb_info().occupancy(1, 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 进阶技巧. 氨基酸残基PDB LABEL\n",
    "PyRosetta中的PDBinfo除了储存一些实验信息以外，还可以储存用户的自定义信息，比如用户可以通过label系统对一些氨基酸打上标签，在后续的氨基酸范围选取中快捷方便的操作。\n",
    "进阶部分主要介绍:\n",
    "- add_reslabel/get_reslabels/clear_reslabel/res_haslabel\n",
    "- icode/set_resinfo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. 通过pdbinfo的add_reslabel函数对残基打标签:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 打标签\n",
    "pose_pdbinfo.add_reslabel(1, 'starts')\n",
    "pose_pdbinfo.add_reslabel(2, 'haha')\n",
    "pose_pdbinfo.add_reslabel(3, 'end')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "vector1_std_string[starts]\n",
      "vector1_std_string[haha]\n",
      "vector1_std_string[end]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 查标签\n",
    "print(pose_pdbinfo.get_reslabels(1))\n",
    "print(pose_pdbinfo.get_reslabels(2))\n",
    "print(pose_pdbinfo.get_reslabels(3))\n",
    "pose.dump_pdb('reslabeled.pdb')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在输出PDB文件之后，在文件的最下方可以看到以下的信息被记录:\n",
    "- REMARK PDBinfo-LABEL:    1 starts\n",
    "- REMARK PDBinfo-LABEL:    2 haha\n",
    "- REMARK PDBinfo-LABEL:    3 end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 清除标签\n",
    "pose_pdbinfo.clear_reslabel(1)\n",
    "pose_pdbinfo.clear_reslabel(2)\n",
    "pose_pdbinfo.clear_reslabel(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. 关于icode插入与管理\n",
    "在PDB文件中，比如处理抗体结构时，一些特殊的PDB编号会引入icode的概念（insert code）来区别想到氨基酸编号残基的问题。\n",
    "如 1A 2B 3C等(此处的代码并非是链号)。\n",
    "通过PDBinfo，用户也可以很方便的插入这些字符。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set_resinfo(res: int, chain_id: str, pdb_res: int, ins_code: str) -> None\n",
    "pose_pdbinfo.set_resinfo(1, 'A', 1, 'm')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "m\n"
     ]
    }
   ],
   "source": [
    "print(pose_pdbinfo.icode(1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 保存PDB\n",
    "pose.dump_pdb('insert_icode.pdb')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用文本编辑器打开后可以见，第一个氨基酸的残基编号发生了变化:从1变成了1m\n",
    "\n",
    "ATOM      1  N   PRO A   1m     35.432  -0.708   7.647  1.00 49.13           N  \n",
    "ATOM      2  CA  PRO A   1m     35.959   0.478   8.332  1.00 38.65           C  \n",
    "ATOM      3  C   PRO A   1m     36.620   1.469   7.374  1.00 28.53           C  \n",
    "ATOM      4  O   PRO A   1m     36.946   1.110   6.240  1.00 28.02           O  \n",
    "ATOM      5  CB  PRO A   1m     36.987  -0.100   9.317  1.00 32.04           C  \n",
    "ATOM      6  CG  PRO A   1m     37.119  -1.563   8.970  1.00 40.69           C  \n",
    "ATOM      7  CD  PRO A   1m     35.846  -1.957   8.305  1.00 42.21           C  \n",
    "ATOM      8  HA  PRO A   1m     35.141   0.977   8.872  1.00  0.00           H  \n",
    "ATOM      9 1HB  PRO A   1m     37.944   0.434   9.218  1.00  0.00           H  \n",
    "ATOM     10 2HB  PRO A   1m     36.642   0.048  10.351  1.00  0.00           H  \n",
    "ATOM     11 1HG  PRO A   1m     37.985  -1.720   8.311  1.00  0.00           H  \n",
    "ATOM     12 2HG  PRO A   1m     37.300  -2.154   9.880  1.00  0.00           H  \n",
    "ATOM     13 1HD  PRO A   1m     36.044  -2.759   7.579  1.00  0.00           H  \n",
    "ATOM     14 2HD  PRO A   1m     35.123  -2.291   9.064  1.00  0.00           H  \n",
    "ATOM     15 1H   PRO A   1m     35.759  -0.700   6.700  1.00  0.00           H  \n",
    "ATOM     16 2H   PRO A   1m     34.427  -0.649   7.643  1.00  0.00           H  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "pyrosetta",
   "language": "python",
   "name": "pyrosetta"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
